Automatic Long Audio Alignment and Confidence Scoring for Conversational Arabic Speech
نویسندگان
چکیده
In this paper, a framework for long audio alignment for conversational Arabic speech is proposed. Accurate alignments help in many speech processing tasks such as audio indexing, speech recognizer acoustic model (AM) training, audio summarizing and retrieving, etc. We have collected more than 1,400 hours of conversational Arabic besides the corresponding human generated non-aligned transcriptions. Automatic audio segmentation is performed using a split and merge approach. A biased language model (LM) is trained using the corresponding text after a pre-processing stage. Because of the dominance of non-standard Arabic in conversational speech, a graphemic pronunciation model (PM) is utilized. The proposed alignment approach is performed in two passes. Firstly, a generic standard Arabic AM is used along with the biased LM and the graphemic PM in a fast speech recognition pass. In a second pass, a more restricted LM is generated for each audio segment, and unsupervised acoustic model adaptation is applied. The recognizer output is aligned with the processed transcriptions using Levenshtein algorithm. The proposed approach resulted in an initial alignment accuracy of 97.8-99.0% depending on the amount of disfluencies. A confidence scoring metric is proposed to accept/reject aligner output. Using confidence scores, it was possible to reject the majority of mis-aligned segments resulting in alignment accuracy of 99.0-99.8% depending on the speech domain and the amount of disfluencies.
منابع مشابه
A Framework for Conversational Arabic Speech Long Audio Alignment
We propose a framework for long audio alignment for conversational Arabic speech. Accurate alignments help in many speech processing tasks such as audio indexing, speech recognizer acoustic model (AM) training, audio summarizing and retrieving, etc. In this work, we have collected more than 1400 hours of conversational Arabic besides the corresponding non-aligned text transcriptions. Automatic ...
متن کاملSpeech recognition based confidence measures for building voices from untranscribed speech
Today, large amount of audio data is available on the web in the form of audiobooks, podcasts, video lectures, video blogs, news bulletins. In addition, we can effortlessly record and store audio data such as read/lecture/impromptu speech on hand-held devices. These data are rich in prosody, provide a plethora of voices to choose from, and their availability can significantly reduce the overhea...
متن کاملTechnique for automatic sentence level alignment of long speech and transcripts
A frugal approach to construct speech corpora, specially for resource deficient languages, is to exploit collections of speech and corresponding text data available in audio books, news, lectures. However, using these resources for building speech corpora require an alignment of the long duration speech data with the accompanying text data. Existing techniques for automatic speech-text alignmen...
متن کاملThe EPAC Corpus: Manual and Automatic Annotations of Conversational Speech in French Broadcast News
This paper presents the EPAC corpus which is composed by a set of 100 hours of conversational speech manually transcribed and by the outputs of automatic tools (automatic segmentation, transcription, POS tagging, etc.) applied on the entire French ESTER 1 audio corpus: this concerns about 1700 hours of audio recordings from radiophonic shows. This corpus was built during the EPAC project funded...
متن کاملAutomatic Phonetic Transcription in Two Steps: Forced Alignment and Burst Detection
In the last decade, there was a growing interest in conversational speech in the fields of human and automatic speech recognition. Whereas for the varieties spoken in Germany, both resources and tools are numerous, for Austrian German only recently the first corpus of read and conversational speech was collected. In the current paper, we present automatic methods to phonetically transcribe and ...
متن کامل